Skip to content

Conversation

@jgallagher
Copy link
Contributor

This builds on the BlueprintExpungedZoneAccessReason added in #9608, and adds a new set of zone IDs to PlanningInput for "expunged and unreferenced" zones. These are zones that the planner is safe to prune in a future planning iteration, because they are:

  1. Expunged and ready for cleanup
  2. No longer referenced in CRDB, implying any necessary cleanup is done

This PR only adds the new set of zones to PlanningInput. It will be nearly trivial to add a step to the planner to "prune all zones present in the planning input's expunged_and_unreferenced set", but out of an abundance of caution I'd prefer to land this first, collect a reconfigurator state from some deployed racks, and confirm the contents match what we'd expect on some real systems.

This is a little weird in that the logic for whether a zone can be pruned lives in the "prepare a planning input" crate rather than the planner itself, but I think this is reasonable given the implementation of several of the checks (e.g., "query CRDB for this expunged zone specifically" to see whether it's still referenced, which the planner can't do since it doesn't have access to CRDB). Happy to discuss alternatives if this seems wrong upon review.

(We may also want to wait on #7278 before adding the actual prune step?)

// We have no way to confirm that this zone is "unreferenced" - that's a
// property of the system at large, mostly CRDB state - but we can
// confirm that it's expunged and ready for cleanup by looking at the
// parent blueprint.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally going to add a couple blippy lints that any zone in expunged_and_unreferenced is (a) present and (b) actually expunged, but with the guard in this method and the fact that PlanningInput's fields are private, I don't believe it's actually possible to construct a PlanningInput with any such zone IDs in expunged_and_unreferenced, so it was impossible to write a test to confirm the blippy lints triggered.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is really clean. I've commented on one possible bug here and the rest is nitty feedback.

.map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))
}

/// many queries as needed (in batches) to get them all
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a stray line

//! Helpers for identifying when expunged zones are no longer referenced in the
//! database.
//!
//! When a zone is expunged, the expunged zone is still present in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! When a zone is expunged, the expunged zone is still present in the
//! When a zone is first expunged, the expunged zone initially remains in the

//! blueprint. Two high-level conditions must be satisfied before it's safe to
//! prune an expunged zone from the blueprint (i.e., delete it entirely):
//!
//! 1. The sled-agent responsible for running the zone must confirm the zone is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think item 1 here is an "or" of either (1) what you have here, or (2) the sled is expunged.

//! 2. Any cleanup work that operates on expunged zones must be complete. This
//! is zone-type-specific. Some zone types have no cleanup work at all and
//! can be pruned as soon as the first condition is satisfied. Others have
//! multiple, disparate cleanup work, all of which must be completed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! multiple, disparate cleanup work, all of which must be completed.
//! multiple, disparate cleanup steps, all of which must be completed.

use std::net::IpAddr;
use std::num::NonZeroU32;

pub(super) async fn find_expunged_and_unreferenced_zones(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels important to document this function.

}
}

fn in_service_boundary_ntp_upstream_configs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked at how this is used: why not compute these two sets once in find_expunged_and_unreferenced_zones() and then pass them to the two helpers that use this object?

This isn't bad or anything -- it just took me a minute to figure out what was going on with it and I think in the end it's simpler than it looks.

// it's still "referenced" (in that the planner needs to refer to it to set
// up a new boundary NTP zone).
let expunged_config = BoundaryNtpUpstreamConfig::new(boundary_ntp);
bp_refs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks backwards to me. This returns whether an in-service zone has the same upstream NTP config as the expunged zone. That's the case where the zone is not referenced any more, right?

(If that's right, there's probably also a test gap here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, great catch.

(If that's right, there's probably also a test gap here.)

Agreed - I was planning on manually testing this first by grabbing reconfigurator-cli state from a system (once this landed) and then figuring out where those tests might live. I don't think putting them in reconfigurator-cli (or using the underlying simulator) will work, because those don't have access to a live CRDB, and the majority of logic in this module requires querying for not-blueprint bits in the db. Maybe just unit tests here, and know that there will be quite a lot of boilerplate to test all these cases?

Comment on lines +184 to +185
clickhouse_config.keepers.contains_key(&zone_id)
|| clickhouse_config.servers.contains_key(&zone_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about this to evaluate this case -- may want to ask @karencfv or @andrewjstone to double-check.

Comment on lines +192 to +196
// Check BlueprintExpungedZoneAccessReason::ExternalDnsExternalIps; we
// consider an external DNS zone "still referenced" if its IP is _not_
// assigned to an in-service external DNS zone. (If the IP _is_ assigned to
// an in-service external DNS zone, the expunged zone is no longer
// referenced and can be safely pruned.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General nit and feel free to ignore: I'd suggest instead of framing all these in terms of "referenced", framing them in terms of "needed", and then changing a comment like this to say "We only need to keep track of an expunged external DNS zone if its external IP has not yet been re-used (because this is currently the only way we keep track of which external IPs the external DNS zones should be using)."

That is: "referenced" feels like a special case of "needed". It applies to Oximeter and multi-node Clickhouse and potentially others. But it seems a little wrong to shoehorn a case like this into the term "referenced".

(Obviously not a big deal)

Comment on lines +208 to +211
if !datastore
.database_nexus_access_all(opctx, &BTreeSet::from([zone_id]))
.await?
.is_empty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already load this information separately for the planning input?

(Not sure it's worth optimizing if we do, but figured I'd mention it. I don't think there's a correctness issue if we get different results? But worth being sure.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite - we only load rows in the DbMetadataNexusState::{Active,NotYet} states. But this check needs to confirm there aren't rows in any state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants